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Abstract 

In this paper, we prove decidability properties and new results on the posi- 
tion of the family of languages generated by (circular) splicing systems within 
the Chomsky hierarchy. The two main results of the paper are the following. 
First, we show that it is decidable, given a circular splicing language and a 
regular language, whether they are equal. Second, we prove the language 
generated by an alphabetic splicing system is context-free. Alphabetic splic- 
ing systems are a generalization of simple and semi-simple splicing systems 
already considered in the literature. 
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i 1. Introduction 



Splicing systems were introduced by T. Head [10, 11, [12) as a model 
of recombination. The basic operation is to cut words into pieces and to 
reassemble the pieces in order to get another word. 

There are several variants of splicing systems for circular or linear words 



[12|. In this paper, we consider Paun's circular splicing, and we introduce a 
new variant that we call flat splicing. In both cases, the system is described 
by an initial set of words and a finite set of rules. The language generated 
is the closure of the initial set under the application of splicing rules. 

A splicing rule is a quadruplet of words, usually written as a#f3$j#5. 
The words a, f3, 7, 5 are called the handles of the rule. A rule indicates where 
to cut and what to paste. More precisely, in a circular splicing system, given 
a rule a#/3$7#<5 and two circular words, the first of the form ua ■ f3v and the 
second of the form jw5, we cut the first word between a and j3, the second 
word between 5 and 7 and stick a with 7 as well as 5 with (3 in order to get 
the new circular word ua ■ jw5 ■ f3v, see Figure [l]. The case of flat splicing 
systems, which involves linear words, is similar, see Figure |2[ In order to 
emphasize the position where to cut and the condition on what to paste, we 
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19 prefer to write {a\^f—5\(3) instead of a#/3$7#5. This indicates more clearly 

20 that one word is cut between a and /3, and that the word to be pasted is in 

21 7,4*5. 




Figure 1: Circular splicing. 

Our purpose, in introducing flat splicing systems, is to get a direct ap- 
proach to standard results in formal language theory. Circular systems are 
handled, in a second step, by full linearization. 

D. Pixton [[14|, [15|] has considered the nature of the language generated by 
a splicing system, with some assumptions about the splicing rules (symmetry, 
reflexivity and self-splicing). He proves that the language generated by a 
splicing system is regular (resp. context-free), provided the initial set is 
regular (resp. context-free). More generally, if the initial set is in some full 
AFL, then the language generated by the system is also in this full AFL. 
Without the additional assumptions on the rules, it is known that one may 
generate non-regular languages even with a finite initial set (R. Siromoney, 
K. G. Subramanian and V. R. Dare PI ). A survey of recent developments 
along these lines appears in 

o a ^ P a < 7 o < J_^ 

a 7^ 5 P 



Figure 2: Flat splicing. 

35 In this paper, we prove decidability properties and new results on the 

36 position of splicing systems and their languages within the Chomsky hier- 

37 archy. We introduce a special class of splicing rules called alphabetic rules. 

38 A rule is alphabetic if its four handles are letters or the empty word. A 

39 splicing system is alphabetic when all its rules are alphabetic. Special cases 
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of alphabetic splicing systems, called simple or semi-simple systems, have 
been considered in the literature 0, ||, |j. In a semi-simple system, all rules 
a#/3$7#<5 satisfy the condition that the words a/3 and 7<5 are letters. In a 
simple system, one requires in addition that these letters are equal, that is 
a/3 = ^5. 

We show that alphabetic systems have several remarkable properties that 
do not hold for general systems. 

We consider first the problem of deciding whether the language generated 
by a splicing system is regular. The problem is still open, but has been solved 
in special cases |^. Our contribution is the following (Theorem 3.1). It is 
decidable, given a circular splicing language and a regular language, whether 
they are equal. The corresponding inclusion problems are still open. We also 
show (Remark |3.5| ) that it is decidable whether a given regular language is 
an alphabetic splicing language. This is related to another problem that 
we do not consider here, namely to give a characterization of those regular 
languages that are splicing languages, or vice- versa. For recent results see ||, 
and for a survey see 

The next problem we consider concerns the comparison of the family of 



splicing languages with the Chomsky hierarchy. We first prove (Theorem 4.1 



that splicing languages are always context-sensitive. Next, we prove, and 
this is the main result of the paper (Theorem 5.3), that alphabetic splicing 
languages are context-free. The proof of this result is in several steps. 

We consider first a special class of systems called pure, and we prove 
(Theorem 6J3) that pure alphabetic systems generate context-free languages, 
even if the initial set is itself context-free. 

We next consider another special class of systems called concatenation 
systems. In those systems, insertions always take place at one end of the 
word. We show (Theorem 7.1) the language generated by a concatenation 
system is context-free, even if the initial set is itself context-free. 

The next step is to mix these two kinds of splicing systems. We call them 
heterogeneous systems. Every alphabetic splicing system is heterogeneous. 
The key observation, for the proof of the main result, is that in a heteroge- 
neous system, all concatenations can be executed before any proper insertion 
(Lemma 7.8). We call this a weak commutation property. The main theorem 
then easily follows. 

The relation between circular and flat alphabetic splicing systems is de- 
scribed in Proposition |8.3| . It follows easily that the main result also holds 
for circular splicing (Theorem 8,4). 

The proofs rely on so-called generalized context-free grammars, a notion 
that is rather old but seems not to be well-known. All proofs are effective. 
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The paper is organized as follows. We start by introducing the new type 
of splicing systems called flat splicing systems: these systems behave like 
circular systems, but operate on linear words. 

In Section 3, we prove the decidability result mentioned earlier (Theo- 
rem |3.1| ). Section |I| contains the proof that splicing languages are always 
context-sensitive. 

Section || defines alphabetic splicing systems and states the main result 
(Theorem 5.3) namely that the language generated by a flat or circular al- 
phabetic splicing system is context-free, even if the initial set is context-free. 
In this section, a normalization of splicing systems called completion is pre- 
sented. The complete systems defined here are not the same as the complete 
systems in M\. 

Section [6 introduces pure splicing systems. Here, it is proved that the 
language generated by a context-free pure splicing system is context-free. 
The proof uses some results on context-free languages which are recalled in 
Section |6.1[ 

In the next section (Section 0), we first define concatenation systems and 
prove that (alphabetic) concatenation systems produce only context-free lan- 
guages. Then heterogeneous systems are defined, and the weak commutation 
lemma (Lemma 7.8) is proved. This section ends with the proof of the main 
theorem for flat splicing systems. 

Section ||] describes the relationship between flat and circular splicing 
systems and their languages. It contains the proof of the main theorem for 
circular splicing systems. 

The proofs that context-free alphabetic concatenation and pure systems 
generate context-free languages is done by giving explicitly the grammars. 
These grammars deviate from the standard form of grammars by the fact 
that the set of derivation rules may be infinite, provided they are themselves 
context-free sets. It is an old result from formal language theory j0|] that 
generalized context-free grammars of this kind still generated context-free 
languages. For sake of completeness, we include a sketch of the proof of this 
result, together with an example, in an appendix. 

The results of the paper were announced by the third author in ||. 



in 2. Definitions 

ii4 2.1. Words, circular words 

us As usual, an alphabet A is a finite set of letters. A word u = uoui . . . u n -\ 

116 is a finite sequence of letters. When it is useful to compare words with 

117 circular words defined below, they will be called linear words. 
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118 Two words u and v are conjugate, denoted by u ~ v if there exist two 

119 words x and y such that u = xy and v = yx. This is an equivalence relation. 

120 A circular word is an equivalence class of ~, that is an element of the quotient 

121 of A* by the relation ~. The equivalence class of u will be noted ~«. We also 

122 say that ~u is the circularization of u. For example, ~o65 = {a&6, 6a6, 65a} 

123 is a circular word. A circular word can be viewed as in Figure |l| as a word 

124 written on a circle. A set of circular words is a circular language. 

125 Let C be a language of circular words. Its full linearization, denoted by 

126 Lin{C), is the language Lin(C) = {u 6 A* \ ~u € C}. 

127 Let L be a language of linear words. Its circularization ~L is equal to 

128 ~L = {~u | u € L). 

129 A language L of circular words is regular (resp. context-free, resp. context- 
no sensitive) if its full linearization is regular (resp. context-free, resp. context- 

131 sensitive) . 

132 Let G be a grammar, the language generated by G will be denoted by 

133 Lq. Let S be a non-terminal symbol, we will denote Lg(S) the language 

134 produced by the grammar G with S as axiom. 

135 2.2. Splicing systems 

136 We start with a short description of circular splicing systems. These 

137 systems are well known, see e.g. Then we present flat splicing systems 

138 which are new systems. They are of interest for proving language-theoretic 

139 results because they allow us to separate operations on formal languages and 

140 grammars from the operation of circular closure (circularization). It appears 

141 that proofs for linear words are sometimes simpler because they rely directly 

142 on standard background on formal languages. 

143 2.2.1. Circular splicing systems 

144 A circular splicing system is a triplet S = (A,I,1Z), where A is an 

145 alphabet, / is a set of circular words on A, called initial set and R is a finite 

146 set of splicing rules, which are quadruplets {a\"f— 5\j3) of linear words on A. 

147 The words a, (3, 7 and 5 are called the handles of the rule. In the literature 

148 (see e.g. (5|), a rule is written as a#/3$7#5. 

149 If r = (al'f— 5\j3) is a splicing rule then the circular words ~u = ~(/3xa) 

150 and ~v = ^(jyS) produce the circular word ~w = ~((3xa^y5). We will 

151 denote this production by ~u,~v \~ r ~w. The language generated by the 

152 circular splicing system is the smallest language C of circular words contain- 

153 ing / and closed by R, i.e., such that for any couple of words ~u and ~u in 

154 C and any rule r in R, any circular word ~u> such that ~u, ~u h r ~w is also 

155 in C. This set of circular words is denoted by C{S). 
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156 A circular splicing system is finite (resp. regular, context-free, context- 

157 sensitive) if its initial set is finite (resp. regular, context-free, context- 

158 sensitive) . 

159 A splicing rule r = (a\~/—5\(3) is alphabetic if its four handles a, (3, 7 and 

160 5 are letters or the empty word. A circular splicing system is alphabetic if 

161 all its rules are alphabetic. 

162 Example 2.1 Let S = (A,I,R) be the (finite alphabetic) circular splic- 

163 ing system defined by I = {~(ab)} and R = {(a\a—b\b)}. It produces the 

164 context-free language C(S') = {~(a n 6 n ) | n > 1}. 

165 2.2.2. Flat splicing systems 

lee A flat splicing system, or a splicing system for short, is a triplet S = 

167 (^4, X, XV), where A is an alphabet, / is a set of words over A, called the initial 

168 set and R is a finite set of splicing rules, which are quadruplets (a\~f— 5\(3) of 

169 words over A. Again, a rule is alphabetic if its the four handles a, f3, 7 and 5 

170 are letters or the empty word. A splicing system is alphabetic if all its rules 

171 are alphabetic. 

172 Let r = (a\j—5\/3) be a splicing rule. Given two words u = xa ■ f3y and 

173 v = jzS, applying r to the pair (u, v) yields the word w = xa- jz5 • j3y. (The 

174 dots are used only to mark the places of cutting and pasting, they are not 

175 parts of the words.) This operation is denoted by u,v h r w and is called a 

176 production. Note that the first word (here u) is always the one in which the 

177 second word (here v) is inserted. 

178 Example 2.2 1. Consider the splicing rule r = (ab\aa— b\c). We have the 

179 production bob ■ cc, aaccb h r bob ■ aaccb ■ cc. 

180 2. Consider the splicing rule (b\a— a\b). Note that we cannot produce 

181 the word b ■ a ■ b from the word b ■ b and the singleton a, because the rule 

182 requires that the inserted word has at least two letters. On the contrary, the 

183 rule (b\e— a\b) does produce the word bob from the words bb and a. 

184 3. For the rule r = (e\a—a\b), the production -bbc,aba h r aba • bbc, is in 

185 fact a concatenation. 

186 4. As a final example, the rule {e\e— e\e) permits all insertions of a word 

187 into another one. 

188 The language generated by the flat splicing system S = ( A, I, R), denoted 

189 ^(S), is the smallest language L containing / and closed by R, i.e., such that 

190 for any couple of words u and v in L and any rule r in R, then any word 

191 such that u, v h r w is also in L. 
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192 Example 2.3 Consider the splicing system over A = {a, b} with initial set 

193 I = {ab} and the unique splicing rule r = (a\a—b\b). It generates the 

194 context-free and non-regular language J-(S) = {a n b n \ n > 1}. 

195 Remark 2.4 A production u, v h r w, where u = e or v = e, even when 

196 it is permitted, is useless. Indeed, one has e, v \~ r v and u, e h r u. As a 

197 consequence, given a splicing system S = (A, I, TV) one has e 6 F{S) if and 

198 only e E I. So we can assume that e ^ I without loss of generality. This 

199 remark holds also for circular splicing systems. 

200 Remark 2.5 A production u, v h> w, where \w\ = 1, even when it is per- 

201 mitted, is useless. Indeed, since \u\ + \v\ = \w\, one has in this case w = u or 

202 w = v. As a consequence, given a splicing system S = (A, 1, 1Z), and a letter 

203 a € A, one has a € J~(S) if and only a £ I. However, we cannot assume 

204 that a ^ I without possibly changing the language it generates. This remark 

205 holds also for circular splicing systems. 



206 Remark 2.6 Flat splicing is different from linear splicing as it is defined in 



[14|. 



Remark 2.7 Let S = (A,I,R) be a flat splicing system and let S' = 
(A, ~I, R) be the circular splicing system with the same splicing rules. The 
full linearization of C(S') is the closure of the linear language / under the 
composition of the two operations of circularization and splicing. However, 
it does not suffice, in general, to just consider a single circularization. In- 
deed, the equality C(S') = ~J-(S) does not hold in general. However, the 
inclusion ~.F(<S) C C(S') is always true. 

Consider the flat splicing system over A = {a,b}, initial set I = {ba} 
and with the single rule (a\a— b\b). Clearly, the rule cannot be applied, and 
consequently the language generated by the system reduces to I, and its 
circularization gives ~J. The circular language generated by the system is 
~{a n 6" | n > 1}, which is much larger than ~J. 



220 3. A decision problem 

221 In this section, we prove the following result. 

222 Theorem 3.1 Given a regular circular (resp. flat) splicing system S and a 

223 regular language K, it is decidable whether C(S) = K (resp. F(S) = K). 
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224 Proof We assume that neither I nor K contains e, since otherwise it suffices, 

225 according to Remark |2.4| , to check that e is contained in both sets. 

226 Let S = (A,I,R). Let A = (A,Q,q ,Qp) be a deterministic automaton 

227 recognizing K, with Q the set of states, q D the initial state and Qp the set of 

228 final states. The transition function is denoted by "•" in the following way: 

229 for a state q and a word v, q ■ v denotes the state that is reached by v from 

230 q. 

231 For any state q G Q, we define G q = {v \ q a ■ v = q} and D q = {v | q ■ v G 

232 Qf}- The set G q is the set of all words which label paths from go to q, and 

233 D q is the set of all words which label paths from q to a terminal state. Both 

234 sets are regular. 

235 Next, let P = {w G A* | u, v \~ r w,r G R,u,v G K}. The set P is the 

236 set of the words that can be obtained by splicing two words of K. 

For each rule r = (a|7— (5|/3), let 

K r ={w G A* | u, v \~ r w , with u,v G -ftT} 

={xa ■ 7z<5 ■ (3y \ xa ■ (3y E K, jz5 £ K, x,y, z £ A*} . 

It is easily checked that 

K r = (J (G, n A*a)(if n jA*5)(D q n /L4*) . 

237 This expression shows that each language K r is regular, and so is P = 

238 Urgi? because i? is finite. 

239 We first consider flat splicing system. The algorithm consists in checking 

240 three inclusions. We claim that J-(S) = K if and only if the following three 

241 inclusions hold. 

242 (1) I CK, 

243 ( 2) P C K, 

244 ( 3) K\P CI. 

245 Take the claim for granted. Then the equality J-(S) = K is decidable since 

246 the three inclusions, that involve only regular languages, are decidable. 

247 Now we prove the claim, namely that J~(S) = K if and only if the above- 

248 mentioned three inclusions hold. 

249 If J-{S) = K, then (1), (2) and (3) are obviously true. 

250 Conversely, assume now that these three inclusions hold. Since I C K 

251 by (1) and since K is closed under the rules of splicing of R by (2), obviously 

252 T{S) C K. 
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253 Next, we prove the reverse inclusion K C J-{S) by induction on the 

254 length of the words in K. Let w G K. Since P C K by (2), one has 

255 if = PU (if \P). If w G K \P, then by (3), w G J and therefore it; G P(S). 

256 Otherwise, there are words u, v G if of shorter length such that u, v \~ r w 

257 for some r G P. By induction, u,v £ J~(S) and consequently u> G J~(S). 

258 For circular splicing systems, it suffices to check, in addition, that K is 

259 closed under conjugacy and to replace K r by ~{K r ). □ 

260 Remark 3.2 There are two related problems which are still open. The first 

261 is to decide whether the language generated by a splicing system is regular, 

262 and the second is to decide whether a regular language can be generated by 

263 a splicing system. We shall see below that the second problem is decidable 

264 in the case of what we call alphabetic splicing systems. 

265 Remark 3.3 The inclusion problems, for both inclusions, i.e., the problem 

266 of deciding whether J~{S) C K or whether K C J-{S) (resp. C{S) C K or 

267 K C C{S)) are still open. 

268 Remark 3.4 The characterization of the family of regular languages which 

269 can be obtained by a circular splicing system, is still open. However, partial 

270 results have been obtained by P. Bonizzoni, C. De Felice, G. Mauri and 

271 R. Zizza [H, ||, In particular, a complete characterization of languages 

272 over one letter generated by a splicing system is given in |J. Recently, a 

273 description of the languages generated by a family of alphabetic splicing 

274 systems called semi-simple systems has been given in [||. 

275 Remark 3.5 Given a regular language K over an alphabet A, it is decidable 

276 whether it can be generated by a finite alphabetic splicing system. (The 

277 problem is meaningless for regular systems.) Indeed, observe first that there 

278 are only finitely many alphabetic splicing rules over A. So there are only 

279 finitely many sets of alphabetic splicing rules over A. Choose one such set 

280 and call it R. Define P as in the proof of Theorem [O]. If P (£_ K or 

281 K \ P is infinite, then the test is negative. Otherwise, the splicing system 

282 S = (A, K \P,R) generates K. 

283 4. Splicing languages are context-sensitive 

284 We will see that the highest level in Chomsky hierarchy which can be 

285 obtained by splicing systems with a finite initial set and a finite set of rules 
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286 is the context-sensitive level. This result remains true when the initial set is 

287 context-sensitive. 

288 Before proving this property, we give an example of a splicing language 

289 which is not context-free. 

290 A splicing language which is not context-free 

291 We first consider flat splicing. 

Let A be the alphabet {0, 1, 2, 3, ►, <} and set u = 0123. Let S = 
(A, I, R) be the flat splicing system with 

/ = {+u<, 0,1,2,3} 

and with R composed of the rules 

(►|0-e|u), (0u|0— e|u), 

(0|l-e|u-*), (0|l-e|u01u), 

(►01|2-e|n), (012u01|2-e|u), 

(012|3-e|u^), (012|3-e|uuu) . 

This splicing system produces the language 

T(S) ={+(u) 2n < \ n>0} 

U {*>(0u) p (u)U \p + q = 2", n > 0} 
U {►(0n) p (01u) 9 ^ j p + q = 2 n , n > 0} 
U {►(012u) p (01u) <? <* | p + q = 2 n , n > 0} 
U {*>{Ql2u) p {uu) q < | p + q = 2 n , n > 0} . 

Indeed, given a word ► u n -4, the first two rules of R generate a left-to-right 
sweep inserting the symbol in head of each u: 

>u n < -> ►(0u)n n " 1 ^ -> > ►(0n) n " 1 n^ -> ►(0u) n ^ . 

(We write here x — > y instead of x, h y.) The next two rules generate a 
right-to-left sweep which inserts a symbol 1 in head of each u. This gives 

+ (0u) n < -> ►(01n)(0u) n - 1 ^ -> ► ►(01u) n " 1 0n^ -> ►(01n) n ^ . 

292 The next two rules are used to insert a symbol 2 in head of each u, again in 

293 a left-to-right sweep. This gives the word ► (012ii) n -^. Finally, the last two 

294 rules insert a 3 in head of each u. The final result is >-u 2n <. 
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295 The intersection of the language J-(S) with the regular language ► (u)*-< 

296 is equal to {►(ti) 2 ™-^ | n > 1}. The latter language is not context-free. 

297 Concerning circular splicing systems, recall that a circular language is 

298 context-sensitive if and only if its full linearization is context-sensitive. If 

299 we take the circular splicing system with the same rules and the same initial 

300 language, we can check that the language C{S) is such that C(S) H ►(^)* = 

301 J~(S). Thus, we also can produce a language which is not context-free with 

302 a circular splicing system. 

303 Splicing languages are always context-sensitive 

304 Theorem 4.1 The language generated by a context-sensitive circular (resp. 

305 flat) splicing system is context-sensitive. 

306 The proof uses bounded automata. Recall that a k-linear bounded au- 

307 tomaton (/c-LBA) is a non-deterministic Turing machine with a tape of only 

308 kn cells, where n is the size of the input. We will use in the sequel the follow- 

309 ing characterization of context-sensitive languages. A language is context- 

310 sensitive if and only if it is recognized by a £>LBA (see, for example, ||). It 

311 is known that it is always possible to recognize a context-sensitive language 

312 with a 1-LBA. 

313 Proof We start with the case of a flat system. Let S = (A, I, R) be a 

314 flat splicing system. Let T a 1-LBA recognizing /. We construct a 3-LBA 

315 machine IA which recognizes the language J-{S). 

316 Let u be the word written on the tape at the beginning of the computa- 

317 tion. Let # be a new symbol. The machine works as follows. 

During the computation the word written on the tape has the form 

«i#n 2 # • • • #u n _i#M„ , 

318 where the iij are words on the alphabet A. 



319 Repeat the following operation as long as possible. 

320 (1) If the tape is void, stop and return "yes". 

321 (2) If u n is in the set / (this test is performed by machine T), remove u n 

322 along with the symbol # which may precede u n . 

323 (3) Choose randomly a rule r = (0(7— b\f3) in R, and choose randomly, if 

324 it exists, a decomposition of u n of the form u n = xajybf3z such that 

325 neither xafiz nor "fyb are empty word. Remove the subword ^yyb from 

326 u n and place it at the right after a # symbol. Then shift the string 

327 fiz^^yb so that we have on the tape ui#U2# ■ ■ ■ #u n ^i#xaf3z# r yyb. 

328 If no choice exists, stop the computation. 
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329 It can be easily seen that the length of the tape is always less that 3\u\. If 

330 no computation succeeds, then the word is rejected. 

331 In the case of a circular splicing system, the method is almost the same. 

332 The only difference is that, in the last step, one chooses in addition randomly 

333 one of the conjugates of u n . □ 



334 5. Alphabetic splicing systems 

335 A rule in a splicing system is called alphabetic if its handles have length at 

336 most one. A splicing system is called alphabetic if all its rules are alphabetic. 



The splicing systems of Examples 2.1 and 2.3 are alphabetic. They gen- 



338 erate a non-regular language, although they have a finite initial set. Let us 

339 give another example. 

340 Example 5.1 Let S = (A,I,R) be the flat splicing system defined by A = 

341 {a, a}, I = {aa} and R = {(e|e— e|e)}. It generates the Dyck language. 

342 Recall that the Dyck language over {a, a} is the language of parenthesized 

343 expressions, a, a being viewed as a pair of matching parentheses. 

344 The circular splicing system S = (A,I,R) defined by A = {a, a}, I = 

345 {~(aa)} and R = {(e\e—e\e)} generates the language D of words having as 

346 many a as a. The language D is the circularization of the Dyck language. 

347 Remark 5.2 All examples given so far show that alphabetic splicing sys- 

348 tems generate always a context-free languages, and this is indeed the main 

349 result of the paper. Observe however that we cannot get all context-free 

350 languages as splicing languages with a finite initial set. For example, the 

351 language L = {a n b n c \ n > n} cannot be obtained by such a splicing system. 

352 (Consider indeed the fact that all words in L have the same number of c.) 

353 5.1. Main theorem 

354 We now state the main theorem, namely that alphabetic rules and a 

355 context-free initial set can produce only context-free languages. 

356 Theorem 5.3 (i) The language generated by a circular alphabetic context- 

357 free splicing system is context-free. 

358 (ii) The language generated by a flat alphabetic context-free splicing system 

359 is context-free. 

360 This theorem is effective, that is, we can actually construct a context- 

361 free grammar which generates the language produced by the splicing system. 

362 The rest of the paper is devoted to the proof of this theorem. 
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Section [6| introduces pure splicing systems. Here, it is proved that the 
language generated by a context-free pure splicing system is context-free. 
The proof uses some results on context-free languages which are recalled in 
Section |6.1[ 

In the next section (Section [/]), we first define concatenation systems and 
prove that (alphabetic) concatenation systems produce only context-free lan- 
guages. Then heterogeneous systems are defined, and the weak commutation 



lemma (Lemma 7.8) is proved. This section ends with the proof of the main 
theorem for flat splicing systems. 

Section ||] describes the relationship between flat and circular splicing 
systems and their languages. It contains the proof of the main theorem for 
circular splicing systems. 

The proofs that concatenation systems and alphabetic pure systems gen- 
erate context-free languages are done by giving explicitly the grammars. 
These grammars deviate from the standard form of grammars by the fact 
that the sets of derivation rules may be infinite, provided they are themselves 
context-free sets. It is an old result from formal language theory that 
generalized context-free grammars of this kind still generate context-free lan- 
guages. For the sake of completeness, we include a sketch of the proof of this 
result, together with an example, in an appendix. 

We start with a technical normalization of splicing systems. 

5.2. Complete set of rules 

Completion of rules is a tool to manage the usage of the empty word e 
among the handles a, f3, 7, 5 of an alphabetic rule 

r = (a\j—6\f3) 

in a production 

u,v h r w . (5.1) 
Assume first that 5 = e. (The case where 7 = e is symmetric.) In this case, 



the production (5.1) is valid provided v starts with 7 (and of course if u has 
an appropriate factorization u = xaj3y). Let d be the final letter of v. Then 
the same result is obtained with the rule 

rd = (a\-y-d\f3) , 

385 with only one, but noticeable exception: this is the case where v is a single 

386 letter, that is v = 7. Observe that this may happen only if v is in the initial 

387 set of the system. 
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In other words, a production 

r = (a\j—e\f3) 

388 is mandatory if and only if 7 G I. For all words v 7^ 7, the production ( |5.1| ) 

389 is realized by the use of the rule where d is the final letter of v. Thus a 

390 rule with 5 = e can be replaced by the set of rules r^, for d G A with one 

391 exception. 

Assume next that f3 = e. (The case where a = e is symmetric.) In this 
case, the production 

u, v h r w 

is valid provided a occurs in u (and v begins with 7 and ends with 5). This 
holds in particular when a is the final letter of u. In this case, one gets 

w = uv . 

392 In other words, the application of the rule reduces to a simple concatenation. 

393 If however u has another occurrence of a, that is if u = xay for some y 7^ e, 

394 then the rule r can be replaced by the appropriate rule = {a\j—5\d), 

395 where d is the initial letter of y. 

In conclusion, the use of a rule 

r = (a\j—e\/3) (resp. r = (a\e—5\f3)) 
can always be replaced by the use of a rule 

r = (a\j—d\(3) (resp. r = (a\c—5\(3)) 

396 for letters c,d G A, except - and this is the only case - when the word to be 

397 inserted is a single letter which is in the initial set. 

On the contrary, the use of a rule 

r = (a|7~ S\e) (resp. r = (e\^/—5\(3)) 
can be replaced by the use of a rule 

r = (a|7~ 8\b) (resp. r = (0I7— 8\P)) 

398 for letters a, b G A, except when the result is a concatenation w = uv (resp. 

399 w = vu). 
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400 Example 5.4 Let S = (A,I,R) with A = {a,b, c}, I = {abc,abb} and R 

401 composed of the single rule r = (b\a— b\e). The rule r permits the production 

402 ab ■ c, abb h r ab ■ abb ■ c. This production could also be realized with the rule 

403 r' = (b\a—b\c) obtained from r by replacing e by c. Similarly, the production 

404 ab ■ b, abb h r ab ■ abb ■ b could also be realized with the rule r" = (b\a—b\b). 

405 Conversely, all productions that can be realized with r' and r" can be made 

406 with r. 

407 We can thus check that the system S' = (A, I, R') with the set of rules 

408 R' = {(b\a— b\e), (b\a—b\a), (b\a— b\b), (b\a—b\c)} produces the same language 

409 as the system S does. 

410 However, the production abb-, abb h r abb ■ abb, cannot be obtained by 

411 use of a production without e-handle. So, the system S" = {A, I, R") with 

412 R" = {(b\a— b\a), (b\a—b\b), (b\a—b\c)} does not produce the same language 

413 as the system S does. 

414 We say that a splicing system S = (A, I, R) is complete if for any rule 

415 r = {ai\ct3— a4|a2) in R, whenever one or several of the cti are equal to the 

416 empty word, then the set R contains all rules obtained by replacing some or 

417 all of the empty handles by all letters of the alphabet. 

418 For example, the system S = (A, I, R') is complete. The completion of 

419 a splicing system consists in adding to the system the rules that makes it 

420 complete. Completion is possible for alphabetic splicing systems without 

421 changing the language it produces. 

422 Lemma 5.5 For any alphabetic splicing system S = (A,I,R), the complete 

423 alphabetic splicing system S = (A, I, R) obtained by completing the set of 

424 productions generates the same language. 

425 The proof is left to the reader. 

426 Observe that completion may increase considerably the number of rules 

427 of a splicing system. Thus, over a fc-letter alphabet, completing a rule with 

428 one e-handle adds k rules, and if the rule has two e-handles, completion adds 

429 k 2 + 2k rules. . . 

430 In the proof of Theorem i.e., in Sections |7[ |8] we will assume that 

431 splicing systems are complete. 

432 Remark 5.6 Complete systems may simplify some verifications. Thus, in 

433 order to verify that one may insert a letter a between some letters d and b, 

434 it suffices to check that one of (d\a—e\b) or (d\e—a\b) is in the set of splicing 

435 rules. Otherwise we would also have to check whether one of {e\a—e\b) or 

436 (e\a—e\e) or (d\e—e\b),. . . is in the set of splicing rules. 
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437 6. Pure splicing systems 

438 In this section, we consider a subclass of splicing systems called pure 

439 systems, and we prove (Theorem |6.3| ) that these systems generate context- 

440 free languages. We start with a description of two theorems for context-free 

441 languages that will be useful. 

442 6.1. Two theorems on context-free languages 

443 We recall here, for the convenience of the reader, the notion of context- 

444 free substitutions, generalized context-free grammars along with two substi- 

445 tution theorems. A sketch of proof of the second theorem and an example 

446 are given in the appendix. 

Let A and B be two alphabets. A substitution from A* to B* is a mapping 
a from m A* into subsets of B* such that a(e) = {e} and 

a(xy) = a{x)a(y) 

447 for all x, y € A* . The product of the right-hand side is the product of subsets 

448 of B*. The substitution is called finite (resp. regular, context-free, context- 

449 sensitive) if all the languages a (a), for a letter of A, are finite (resp. regular, 

450 context-free, context-sensitive). 

451 The usual substitution theorem for context-free languages (see, for ex- 

452 ample, ||) is the following. 

453 Theorem 6.1 Let L be a context-free language over an alphabet A and let 

454 a be a context-free substitution. Then the language cr(L) is context-free. 

455 A more general theorem, which is also a kind of substitution theorem, is 



456 due to J. Krai [13|. In order to state it, we introduce the following definition. 

457 A generalized grammar G is a quadruplet (A, V, S, R), where A is a terminal 

458 alphabet, V is a non-terminal alphabet, and S € V is the axiom. The set of 

459 rules R is a possibly infinite subset of V x (^4 U V)*. For each, v (E.V, define 

460 M v = {m «4m£ R}. In an usual context free grammar, the sets M v 

461 are finite. The grammar G is said to be a generalized context-free grammar 

462 if the languages M v are all context-free. 

463 Derivations are defined as usual. More precisely, given v,w € (A U V)*, 

464 we denote by v — > w the fact that v directly derives w and by v A w the 

465 fact that v derives w. The language generated by G, denoted Lq, is the set 

466 of words over A derived from S, i.e., Lq = {u G A* \ S — > u}. 

467 It will be convenient, in the sequel, to use the notation v — > X^meM„ 171 

468 or v — > M v as shortcuts for the set of rules {v — > m \ m G M v }. 
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469 Thus, the only difference between usual and generalized context-free 

470 grammars is that for the latter the set of productions may be infinite, and 

471 in this case it is itself context-free. 



472 Theorem 6.2 |13| The language generated by a generalized context-free gram- 

473 mar is context-free. 

474 A sketch of the proof of this theorem is given in the appendix. 

475 6.2. Pure alphabetic splicing systems 

476 A splicing rule r = (a\j—5\(3) is pure if both a and f5 are nonempty. If 

477 the rule is alphabetic, this means that a and f3 are letters. A splicing system 

478 is pure if all its rules are pure. 

479 Theorem 6.3 The language generated by an alphabetic context-free pure 

480 splicing system is context-free. 

481 Proof Let S = (A, I, R) an alphabetic context-free pure system. We sup- 

482 pose that the set R is complete. 

483 We construct a generalized context-free grammar G with axiom S, ter- 

484 minal alphabet A and with non-terminals S and a B b , aWf, for a,b <E A, and 

485 V a for a G A. 

486 The variable a V^ is used to derive words with at least two letters that 

487 begin with a letter a and end with a letter b. The variable V a is used to 

488 derive the word a if it is in the set I. 

489 A symbol a B b is always preceded by a letter a or by a letter V a or by a 

490 letter fW a , and is always followed by a letter b by or a letter 1^ or by a letter 

491 $Vd- Roughly speaking, the symbol a B b denotes words for which eventually 

492 there is a letter a preceding it and and a letter b following it. 

We define an operation 

Ins : A + -> (A U \J a B b )+ 

a,b<=A 

by Ins(x) = x for x G A, and on words 0^203 ■ ■ ■ a n -\a n where a\, . . . ,a n G 
A and n > 2, by setting 

Ins(aia 2 a 3 • • • a n _ia n ) = a x ai B a2 a 2 a2 B a * a 3 . . . a n _i a "-i£ a " a „ . 

493 The derivation rules of G are divided into the three following groups. (Here 

494 a and b are letters in A) 
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The first group contains derivation rules that separate words according 
to their initial and final letters, and single out one-letter words. 

S -> a W b , 
S^V a , 

a W b -> Ins(J n aA*b) , 
V a -> Ida. 

495 We use here the convention that a derivation rule of the last type is not 

496 added if / n a is empty. Similarly, the third sets in these derivation rules 

497 may be empty. Observe that these sets may also be context-free. 

The second group reflects the application of the rules in R. It is composed 
of 

a B b _^ a B c d B b ^ for ( a | c _ d | & ) g R ; 

a B b _^ a B c y c ^ for ^| c _ £ |^ £ R Qr ( a | e _ c |fc) g . 

The third group of derivation rules is used to replace the variables a B b by 
the empty word. 

a B b — > e . 

498 By Theorem [T^, the language generated by G is context-free. 

We claim that Lq = T{S). Consider a derivation 

54w, with w G j4 + 

in the grammar G. Suppose now that, in this derivation, we remove all 
derivation steps involving a derivation rule of the third group. Then the 
derivation is still valid, and the result is a derivation 

S A Ins(u>) . 

499 Conversely, given a derivation S A Ins(w), one gets a derivation S A w by 

500 simply applying the necessary derivation rules of the third group. 

501 We denote by V q the language obtained without applying the produc- 

502 tions of the third type, and by L'ciaWb) an d by L'g(Vi) the languages ob- 

503 tained when starting with the variable aWj, (resp. with V a ), and we prove 

504 that L' G = Ins(7"(5)). 

505 First, we prove the inclusion L'q f= Ins(J-"(5)). For this, we prove by 

506 induction on the length of the derivations in G' that for all letters a,b € A, 

507 we have L' G ( a W b ) C Ins(J"(5) n aA*b) and that L' G (V a ) C Ins(.F(«S) n a). 
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508 A derivation X — > v is called terminal if v does not contains any occur- 

509 rence of variables other than a B b , for a, b € A. It is easy to check that the 

510 length of terminal derivations are always odd. 

The only terminal derivations of length one are 

a W b -> Ins(7 n aA*b) , 
Vagina, 

511 and the inclusion is clear. 

Assume that the hypotheses of induction hold for derivations of length 
less than k and let u be a word obtained by a derivation of length k. Since 
the length of the derivation is greater than 1, the derivation starts with a 
derivation step S — > JVb f° r some a,b € A. The last two derivation steps 
have one of the following form 

c B d _^ c B e ^ f B d _^ c B e x f B d ^ with x € Ing ^ n eA *j^ ^ 
c B d _^ c B e y e e B c _^ c B e g e^c ^ ^ g £ j ^ 

512 for suitable letters c, d, e, /. 

In the first case, there are words v, w such that 

a W b A v c B d w -»• v c B e e W/B d w -»• v c B e x f B d w = u . 

513 By induction, w c .B d to € Ins(J-~(<S) n a^4*6). Since the derivation rule c S rf — )• 

514 c 5 e e Wf is in G, there is a splicing rule (c\e—f\d) in R. Consequently, 

515 the word u is in Ins(J 7 (5) fl aA*6). The second case is similar. This proves 

516 the inclusion L'q C Ins(J"(5)). 

517 Now, we prove the inclusion Ins(J r (5)) C L'q. For this, we prove that for 
sis all letters a, b G A, we have lns(J r (S)DaA*b) C L' G ( a W b ) and Ins(J r (5)na) C 

519 L'ciVa) 

520 We observe that for a letter a, one has J~(S) Da = I Da = Ins(J-"(5) Pi a). 

521 The letter a is thus obtained by the derivation V a — > Ida. Thus we have 

522 Ins(J"(5) fl a) C L' G (X) for all letters a € A 

523 Let us prove the inclusions Insf^J 7 ^) HaA*b) C Lc( a Mi) by induction on 

524 the number of splicing rules used for the production of a word in J r (S)C\aA*b. 

525 Let u € ^(S) fl a^4*6. If no splicing rule is used, then u G I H aA*b. The 

526 word Ins(u) is obtained by the application of the corresponding derivation 

527 rule a W b — > Ins(u) which is in the set a Wb — > Ins(I fl aA*b). Thus u € 

528 L' G ( a W b ). 

529 Assume that the inductive hypothesis holds for the words obtained by 

530 less than k splicing operations, and that u is obtained by application of 
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531 k > 1 splicing operations. We consider the last insertion that leads to u: 

532 there exist three nonempty words v, w and x and a pure rule r G R, such 

533 that v ■ w, x \~ r u = v ■ x ■ w, and moreover vw and x are words of J-(S) 

534 obtained by less than k splicing operations 

Two cases may occur, for suitable letters e and /: 

x G ?(S) n eA*f , 
x £ F(S) n e . 

Consider the first case. Let c be the last letter of v, and let d be the first 
letter of w. Then r = (c\e—f\d). By induction hypothesis, we have a Wb A 
Ins(f) c S rf Ins(iw) (= lns(vw)) and A Ins(x). Moreover, the rule r shows 
that the derivation rule c B d — > c i? e e W^- is in the grammar G. Thus 
combining these three derivations, we obtain 

a W b A Ins(v) c B d lns{w) -> Ins(v) c S e e VF/ Ins(«;) 
A Ins(v) c B e Ins(x) Ins(iw) . 

535 Thus Ins(n) G ^'g(oW^)- The second case is similar. 

536 This shows the inclusion Ins(.F(<S)) C L' q. Consequently Ins(J r (5)) = 

537 L'g, and quite obviously, we can deduce = Lq. □ 

Example 6.4 Consider the pure splicing system 
S = (A,I,R) 

with A = {a, b,c}, I = c*abU c, and with R composed of the rules 
r = (c\e—a\b) , r' = (c\e—b\c) , r" = (a\a— b\b) . 

538 This splicing system generates the language T(S) = c{c U L) + L U {c}, with 

539 L = {a n b n | n > 1}. 

540 For the construction of the grammar for J-(S), we add the completions 

541 of the rules r and r' . We also discard tacitly useless variables. Now, we 

542 observe that I n aA*b = ab, I f~l cA*b = c + ab, I n c = c, and that the 

543 other intersections are empty. Thus, the first group of derivation rules of the 

544 grammar is the following. 

S^ a W b \ c W b | V c 

a W h ^a a B b b 

c W b -> (c c B c )*c c B a a a B b b 
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We observe by inspection, that there is no derivation rule starting with b W x , 
x W a or X W C , for x G A, and similarly for V a , V b . This leaves only the following 
second group of rules. 

a B b _^ a B a ^ 6^6 
c B a _^ c B a ^ b B a 
c B a _^ c B c b B a 

C B C c B a a W b b B c 
C B C -> C B C c W b b B c 

When looking for the final grammar, we may observe that the variables b B x 
for x £ A, and a B a only produce the empty word. Also they can be replaced 
by e everywhere in the grammar. It follows that a B b can be replaced by a\V b . 
Also, it is easily seen that c B a and C B C generate the same language. This 
leads to the following grammar, where we write, for easier reading, X for 
a W b and Y for c W b , and T for c B a . 

S ^ X \ Y\ c 
X ->■ aXb | ab 
Y -> (cT) + X 
T ^TX \TY | e 

545 It is easily checked that this generalized context-free grammar indeed gener- 

546 ates the language .F(<S) = c(c U L) + L U {c}, with L = {a n b n \ n > 1}. 

547 7. Concatenation systems 

548 We introduce a classification of the productions generated in a splicing 

549 system by defining two kinds of productions called proper insertions and 

550 concatenations. 

551 Let r = (a\j—5\f3) be a splicing rule. The production xa • j3y,^z5 h r 

552 xa ■ ^yz5 ■ fly is a proper insertion if xa ^ e and /% 7^ e, it is a concatena- 

553 iion otherwise. If r is a pure rule, then its productions are always proper 

554 insertions. 

555 Of course, the rule r can produce a concatenation only if f3 = e or a = e. 

556 However, such rules can be used for both kinds of productions. Consider for 

557 example the rule r = (a\c—d\e). Then the production aa-, cad h r aa-cad is a 

558 concatenation, while the production a-a, cad h r a-cad-a is a proper insertion. 

559 We consider now rules which are not pure, and we restrict their usage to 
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560 concatenations. This leads to the notion of concatenation systems. We 

561 then show that alphabetic context-free concatenation systems only generate 

562 context-free languages. 

563 7.1. Concatenation systems 

564 A concatenation system is a triplet T = (A, I, R), where A is an alphabet, 

565 / is a set of words over A, called the initial set and R is a finite set of 

566 concatenation rules. A concatenation rule r is a quadruplet of words over A. 

567 It is denoted r = [a—(3\^—S\, to emphasize the special usage which is made 

568 of such a rule. 

569 A concatenation rule r = [a— /3\j— 5] can be applied to words u and v 

570 provided u G a A* (3 and v G r yA*5. Applying r to the pair (u,v) gives the 

571 word w = uv. This is denoted by u, v \= r w and is called a concatenation 

572 production. 

573 The language generated by the system T, denoted by /C(T), is the small- 

574 est language containing / and closed under the application of the rules of 

575 R. 

576 Again, the system T is alphabetic if every rule in R have handles of 

577 length at most one. It is context-free if the initial set I is context-free. The 

578 notion of complete set is similar to the one for splicing rules. 

579 7.2. Alphabetic concatenation 

580 This section is devoted to the proof of the following theorem. 

581 Theorem 7.1 The language generated by an alphabetic context-free concate- 

582 nation system is context-free. 

583 Proof Let T = (A, I, R) an alphabetic context-free concatenation system. 

584 We suppose that the set R is complete. Set K = JC(T). 

585 We construct a grammar G = (T, V, S, R) and a substitution a : T* — > A* 

586 for which we prove that K = o~(Lg). The grammar is quite similar to that 

587 built for Theorem |6.3| . The grammar G has the set of terminal symbols 

588 T = { a Ii, | a, b € A} U {I a | a € ^4}, and the set of non-terminal symbols 
sag V = {S} U { a W b | a, b G A} U {V a \ a G A}. The axiom is S. 

As in the proof of Theorem |6.3| , the purpose of the variables is the fol- 
lowing. The symbol a Wb is used to derive words of length at least 2 that 
start with the letter a and end with the letter b, that is the set K n aA*b. 
Similarly, the symbol V a will be used to derive the word a if it is in K. The 
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terminal symbols a h (resp. /„) are mapped to the sets lDaA*b (resp. in a) 
by the context-free substitution a defined by: 

a{J b ) = Ir\aA*b; a(I a ) = Ir\a. 

590 This substitution is context-free because the set I is context-free. 

591 The derivation rules of the grammar G are divided in two groups. In the 

592 following, a and b are any letters in A. 

The first group contains derivation rules that separate words according 
to their initial and final letters, and single out one-letter words: 

S -> a W b , 
S^V a , 
a W b ->■ a I b , 
Va-Ha. 

593 The second group of rules deals with concatenations: 

594 a W b -> a W c d W b , for [a-c\d-b] G R„ 

595 a W b — > V a ,W b , for [e— a\c— b] G R or [a— e\c— b] G R , 

596 a W b — > a W c V b , for [a— c\e— b] G R or [a— c\b— e] G R, 

597 a Wft — ^ V a V b , for [e— a|e— b] e Ror [a— e\e— b] G -R or [a— e] G R 

598 or [e— a|6— e] G R. 

599 By construction, the language Lq generated by G is context-free, and by 



600 Theorem 3.1, the language ct(Lg) is also context-free. 

601 We claim that ct(Lg) = K. We first prove the inclusion a(Lc) Q K. For 

602 this, we show, by induction on the length of the derivation in G, that for all 

603 letters a, b G A, we have a(L G ( a W b )) C KHaA*b and that a(L G (V a )) C if Ha. 

The only terminal derivations of length 1 are 

a W b -> a I b and one has a( a I b ) = I D ayl*6 C if n a^*6 , 
V a I a and one has cr(i a ) = inaCifna. 

604 Thus the inclusion holds in this case. 

Assume that the hypotheses of induction are true for derivations of length 
less that k and let u a word obtained by a derivation of length k. Since k > 2, 
the first derivation rule is one of the second group, and the derivation has 
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one of the forms 

a W b a W c d W b A u 

a w b -> v a c w b A u 
a w b -> a w c H A it 

a W£ -> F a V b A u 

605 for some a, 6, c, d G A. In the first case, we have u = ui«2, with A ui 

606 and dWh U2, both derivations having length strictly less than k. By the 

607 inductive hypotheses, a{u\) G if fl aA*c and cr(ti2) G if H dA*b. Moreover, 
60s since a M4 — >■ ^W^, is a derivation rule in G, one has [a— c\d— b] G R. This 

609 ensures that (if n aA*c)(if fl dA*b) C if n aA*6. Consequently, cr(u) = 

610 cr(ui)cr(ii2) is in if n aA*&. The other cases are similar. This proves the 

611 inclusion ct(Lg) C if. 

612 We now prove the converse inclusion if C (j{Lq). For this, we prove 

613 that for all letters a, b G A, we have if n aA*b C cr(L<3( a W<£,)) and that 

614 KDaCa(L G (V a )). 

615 It is easy to see, that if a G if then a G i a , cr(7 a ) = a. and V a — > I a . 

616 Thus if fl a C <r(L(3(V^)), for all letter a in A 

617 The inclusions if n aA*b C cr(LG<( a W£)) are proved by induction on the 

618 number of the concatenation operations used. Let u G if (~l aA*b. 

619 If u is obtained without any concatenation, then u G i fl ayl*6 = a( a I b ), 

620 and since — > a l b is a derivation rule in G, we have u G o"(LGr( a M^,)). 

Assume that the inductive hypothesis holds for words obtained by less 
than k concatenations, and that u is obtained by k concatenations. Then 
there exist two words u± and U2 such that u = u\U2 and such that u± and U2 
are obtained by less than k concatenations. There are four cases to consider, 
according to the concatenation rule uses to produce u from u\ and U2- The 
cases are the following. 

u 1 eKC\aA*c, u 2 eKndA*b, 
ui£Kt~)aA*c, u 2 eKnb, 
uiGiffla, u 2 eKC\dA*b, 
ui G if n a , u 2 e K (lb. 

Consider the first case (the other are similar). Since u G if , there is a con- 
catenation rule [a— c\d— b] in R. Consequently, there exists in G a derivation 
rule a W b — > (WcdWb- By induction hypothesis, there is a derivation JV C A ui 
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with u\ = ct(vi), and a derivation <jW b — > V2 with 112 = c(f2)- It follows that 
a W b -> a W c d W b A uiu 2 , 

621 and since a(v\V2) = (j{v\)a{v2) = u, one has u € a(LG( a W b )). This proves 

622 the inclusion K C <t{Lq), and thus the claim. Since ct(Lg) is context-free, 

623 the language is also context-free. This completes the proof. □ 



624 Remark 7.2 Contrary to Theorem 6.3 which is false for systems which are 

625 not alphabetic, Theorem 7A holds for concatenation systems without the 

626 requirement that they are alphabetic. The proof is quite analogous to the 

627 alphabetic case. 

Example 7.3 Consider the concatenation system T = (A, I, R) over the 
alphabet A = {a,b,c}, with / = {ab, c}, and with R composed of the con- 
catenation rules 

[e— c\e— b] , 

[e— c\x— b] for x € A . 
The completion of the system gives the concatenation rules 
[e— c\e— b] , 

[e— c\x— b] for x € A 

[y— c|e— b] for y € A 

[y— c\x— b] for x,y G A . 

According to the construction of the previous proof, these concatenation 
rules give the derivation rules 

(7.1) 
(7.2) 
(7.3) 
(7.4) 



cW b ^ 




cW b ^ 


V c x W b for x £ A 


yW b -> 


yW c V b for ye A 


yW b ^ 


y W c x W b for x,y £ A 


The first group 


of derivation rules is composed only of 


S -> 


x W y for x, y G A 


S -»■ 


Vc 


aW b ^ 


ah 




h 
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because of the set / of initial words. Since there is no derivation rule starting 
with V b , the derivation rules (|7.1| ) and ( |7.3| ) are useless and can be removed. 
Similarly, there is no derivation rule starting with y W c , so the dervation 
rules ( |7.4[ ) can be removed. For the same reason, the variable b W b can be 
removed. Finally, we get the grammar 

s -> a w b | c w b I v c 

a W b -> I 6 

c ws y c a w b I y c c w 6 

and the substitution 

a (ah) = ab 
a(I c ) = c 

628 The language obtained is c*ab + c. 

Remark 7.4 The language )C(T) generated by a concatenation system T = 
(A, I, R) may not be regular, even if / is finite. Consider indeed the system 
given by I = {ab, a, b, c, d} and 

R = {[e—c\a—b], [c—b\d—e], [e—a\c—d], [a— d\b— e]} . 

629 The language obtained is JC(T) = L U cL U cL<i U acLd where L denotes the 

630 L = {(ac) n ab(db) n \ n > 0}, and this language is not regular. 

631 7.3. Heterogeneous systems 

632 A splicing system is a heterogeneous system if all its rules are either pure 

633 rules or concatenation rules. 

634 The aim of heterogeneous systems is to separate the splicing rules ac- 

635 cording the their usage. A pure rule is used for a proper insertion, that is for 

636 producing a word w = xvy from words u = xy and v, with x,y ^ e. On the 

637 contrary, a concatenation rule produces the word w = uv or w = vu, that is 

638 handles precisely the case where x = e or y = e. 

639 The following proposition shows that for any flat alphabetic splicing sys- 

640 tern, there is an alphabetic heterogeneous system with same initial set I 

641 which generates the same language. 
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Proposition 7.5 LetS = (A,I,R) be a complete alphabetic splicing system, 
and let S' = (A, I, R' U R") be the heterogeneous system with same initial set 
I, where R' is the set of pure rules of R, and 

R" = {[ £ -a\-/-5] | (a\j-5\e) € R}U{[j-5\p-s] | (e|7-£|/3) € R} . 

642 Then S and S' generate the same language. 

643 Proof The verification is left to the reader. □ 

644 Example 7.6 Let S be the flat splicing system (^4, /, R) with A = {a, b, c}, 

645 / = {ab,c}, and R = {{a\a— b\b), (c|e— b\e)}. 

We complete R. The complete set of rules for R is 

{a\a— b\b) 
(c\b— b\e) 

(c\x—b\y) for x, y 6 {a, b, c} 
(c\x— b\e) for x G {a,b,c} 
(c\e—b\x) ior x £ {a,b, c} 

The heterogeneous system S' corresponding to S is the system S' = (A, I, R') 
with R' is composed of the pure rules 

(a\a— b\b) 

(c\x—b\y) for x, y € {a, b, c} 
(c|e— b\x) for a; 6 {a, b, c} 

and with the concatenation rules 

[e— c|x— 6] forx€{a,6, c} 
[e— c|e— 6] 



646 which, after completion, give the concatenation rules of Example \LZ . 



647 7.4- Weak commutation of concatenations and proper insertions 

Given a heterogeneous system S = (A,I,R), a production sequence is a 
sequence [tti; iT2', ■ ■ ■ ir n ] of productions such that, setting 

Tr k = (uk,v k \- rk w k ) for 1 < k < n, 

648 each u k and is either an element of I, or is equal to one of the words 

649 u%, V\, Wx, . . . , tife-i, Vk—i, Wk—i- The word w n is the result of the production 

650 sequence. The length of the sequence is n. By convention, there is a pro- 

651 duction sequence of length for each w G /, denoted by [w]. Its result is 

652 w. 
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Example 7.7 Consider the pure system over A = {a,b} with initial set 
/ = {ab} and the unique splicing rule r = (a\a—b\b). In this system, the 
only splicing sequence of length is [ab]. Both production sequences (we 
omit the reference to r) 

[ab,ab h a 2 b 2 ; a 2 b 2 ,a 2 b 2 h a 4 6 4 ] 

and 

[ab,ab h a 2 6 2 ; ab,a 2 b 2 h a 3 6 3 ; a 3 6 3 ,a6 h a 4 6 4 ] 

653 have the same result a 4 6 4 . 

654 Clearly, the language J-(S) generated by a heterogeneous system S is the 

655 set of the results of all its production sequences. 

656 We show that, in an alphabetic splicing system, one always can choose 

657 a particular type of production sequence for the computation of a word, 

658 namely a sequence where the concatenations are performed before proper 

659 insertions. This is stated in the following lemma. 

660 Lemma 7.8 Let S = (A, I, R) be an alphabetic heterogeneous splicing sys- 

661 tern. Given a sequence of proper insertions and concatenation productions 

662 with result u, there exists another sequence with same result u, using the 

663 same rules of proper insertions and concatenations, and such that all con- 

664 catenation productions occur before any proper insertion production. 

Proof Let n = (a\j—S\f3) be a pure rule and let r 2 = [(— ry|/i— v] be 
a concatenation rule, and assume that there is production sequence a = 
[tti; 7r 2 ], with 

TTi = (u, V h n W), 7T 2 = (p, S \= r2 t) , 

665 where u,v,w,p, s,t are words and t = ps. We assume that u,v,p,s are all 
eee non-empty. If neither p nor s is equal to w we replace the sequence [tti; 7^] 

667 by by [7^; 7Ti], and we get the result. 

668 Assume now that p = w or s = w. Since tt\ is a proper insertion, there 

669 exists a factorization u = u\ -U2, with u\ and U2 non-empty words, such that 

670 w = u± ■ v ■ U2, and the production tti can be rewritten as tt\ = {u\ -U2,v h n 

671 u\ ■ v ■ U2). 

672 There are two cases to be considered. 

673 1. p = w, s 7^ w (or the symmetric case p 7^ w, s = w); 

674 2. p = s = w. 
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Case 1: In this case, we have 

7Tl = (Ui • U 2 , v \= ri ui ■ v ■ u 2 ) , vr 2 = (uivu 2 , s \\- r2 uivu 2 ■ s) , 

675 and, in view of the production ir 2 , one has u\ G C,A* and u 2 €. A*r] (here we 

676 use the fact that U\ and u 2 are non-empty, and that £ and r\ have at most 

677 one letter). 

We replace the sequence [vTi;^] by [713; 7r 4 ], where 

7T3 = (U\U 2 , S \= r2 UlU 2 ■ s) , 7T4 = (u\ • U 2 S, V h ri t = U± ■ V ■ U 2 s) . 

678 The concatenation tt^ is valid because and s € fiA*u. 
Case 2: In this case, 

TTl = (ui ■ u 2 ,v h ri ui • i; ■ u 2 ) , 

7T 2 = (uiVU 2 ,UlVU 2 \= r2 UiVU 2 ■ UlVU 2 ) , 

and the sequence [711; -k 2 ] is replaced by [773; 714; 775; 7Ti], where 
7T 3 = {uiu 2 ,uiu 2 \= r2 uiu 2 ■ UlU 2 ) , 

7T 4 = (Ui ■ U 2 UiU 2 ,V h ri • V ■ U 2 UiU 2 ) , 
7T5 = (niVU2«l • W2j V l~ri U\VU 2 U\ • V • U 2 ) . 

679 This proves the lemma. □ 

Remark 7.9 The proposition does not hold anymore if the rules are not 
alphabetic. Consider for example the rules n = ((a\x—y\b)) and r 2 = 
[z—t\ax—e]. Then the splicing sequence 

[a ■ b, xy h ri a ■ xy ■ b; zt, axyb \= r zt • axyb] 

680 cannot be replaced by a sequence where proper insertions occur after con- 

681 catenations. 

682 The following theorem is an immediate corollary of the previous lemma. 

Proposition 7.10 For any language L generated by an alphabetic hetero- 
geneous system S = (A,I,R), there exist a set of alphabetic concatenation 
rules R' and a set of pure alphabetic rules R" , such that 

L = T((A,JC((A,I,R')),R")). 
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683 The combination of Theorems 7.1 and [T^ gives the following theorem, 

684 which is our main theorem in the case of a flat system. 



685 Theorem 7.11 Let S = (A,I,R) be a flat alphabetic context-free splicing 

686 system. Then J~(S) is context-free. 

esy Proof By Theorem [7l0|, T(S) = F({A,K,{{A,I,R')),R")). The language 

688 L = )C((A, I , R')) is context-free in view of Theorem [7,l| , The language 

689 J-((A, L, R")) is context-free by Theorem |6.3| . Thus ^(S) is context-free. 

690 □ 



Example 7.12 Consider again the splicing system S = (A,I,R) with A = 
{a,b,c}, I = {ab,c}, and R = {(a\a— b\b), (c\e— b\e)}. The homogeneous 
system corresponding to S is given in Example [7l| The associated con- 
catenation system T = (A, I, R') has the concatenation rules R' composed 
of 

[e— c\x— b] for x 6 {a, b, c} 
[e—c\e—b] 



and we have seen in Example 73 that it generates the language )C{T) 



c*ab U c. The pure system has the set R" of rules consisting in 
{a\a— b\b) 

(c\x— b\y) for x,y € {a, b, c} 
(c\e—b\x) ior x £ {a,b, c} 



691 As seen in Example |6.4| , it generates the context-free language J-(S) = c(c U 

692 L) + L U {c}, with L = {a n b n | n > 1}. 



693 8. Circular splicing 

694 Recall that a circular splicing system S = (A,I,R) is composed of an 

695 alphabet A, an initial set / of circular words, and a finite set R of rules. A 

696 rule r = (a|7— 5\f3) is applied to two circular words ~u and ~u, provided 

697 there exist words x, y such that u ~ f3xa and v ~ ^yb and produces the 

698 circular word ~{3xajy5. 

699 Example 8.1 Consider the circular splicing system over A = {a, 6}, with 

700 initial set / = {~ab} and with the single rule (a\a— b\b). The rule expresses 

701 the fact that a word starting with the letter a and ending with a letter b can 

702 be inserted, in a circular word, between a letter a followed by a letter b. As 

703 a consequence, the set generated by the system is the ~{a n 6 n | n > 1}. 
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We now show, on this example, that an alphabetic circular splicing sys- 
tem, operating on circular words and generating a circular language, can 
always be simulated by a flat heterogeneous splicing system. This system 
has the same initial set (up to full linearization), but has an augmented 
set of rules, obtained by a kind of conjugacy of the splicing rules. To be 
more precise, we introduce the following notation. Given an alphabetic rule 
(a|7— 5\f}), we denote by ~r the set 

~r = {{a\ 7 -5\f3), (6\(3-a\ 7 ), \fi-a\-y-S\, fr-Stf-a]} . 

704 The rules of the flat splicing system simulating the circular system are the 

705 sets ~r, for all rules r of the circular system. We illustrate the construction 

706 on the previous example. 

707 Example 8.2 Consider the flat splicing system over A = {a, b}, initial set 
70s I = {ba} and with the single rule (a\a— b\b). Clearly, the rule cannot be 

709 applied, and consequently the language generated by the system reduces to 

710 I . 

711 In the world of circular words, the system is transformed into a hetero- 

712 geneous system as follows. 

713 (1) The initial set is now the circular class of I, namely the set ~7 = {ab, ba}. 

714 (2) The rule r = (a\a—b\b) is replaced by ~r; this gives, by conjugacy, 

715 one new pure rule (b\b—a\a) and two concatenation rules [a— b\b— a] and 

716 [b—a\a—b]. 

The use of only the concatenation rules produces the set {ab, ba, abba, baab}. 
Note that this set is not closed under conjugacy. Then, the repeated appli- 
cation the two pure rules produces the set 

{a n b n+m a m | n + m > 0} U {b n a n+m b m \ n + m > 0} . 

717 This set is now closed under conjugacy; it is the language generated with the 

718 four flat rules. Moreover, it is exactly the linearization of the set of circular 

719 words ~{a n b n | n > 1} generated by the circular splicing system. 

720 We prove the following result which shows that the example holds in the 

721 general case. 

722 Proposition 8.3 Let S = (A,I,R) be a circular alphabetic splicing system, 

723 and let S' = (A, Lin(I), R') be the flat heterogeneous splicing system defined 

724 by R' = \J reR ~r. Then Lin{C{S)) = T{S'). 
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725 Proof We prove first the inclusion C(S) C F(S'). For this, suppose that 

726 rule r = (a\^— 5\/3) is applied, in the circular system S to two circular words 

727 ~u and There exist words x, y such that u ~ /3:ra and v ~ 7y5. The 

728 circular word that is produced is ~w with w = fixa^yb. We assume that all 

729 words in ~u, ~v are in T{S') and we have to show that any word in ~w is 

730 in T{S'), by the use of the rules in ~r. First, w is obtained, in S' , from /3xa: 

731 and jy5 by the concatenation rule [f3— a\^— 5], so w € J-(S'). Next, if <z ~ w; 

732 and z ^ w, then z = st and w = ts for some nonempty words s, t. 

733 If t is a prefix of fix, then there is a factorization a; = a/x" such that t = 

734 fix' , s = x"a^yy5. Consequently, z = x" 'cryySfix', showing that z is obtained, 

735 in the system S 1 , from x"af3x' and jyS by the rule r. Since x"af3x' ~ ii, it 

736 follows that z € J-(S'). 

737 If i = /3xa, then s = ^yy5 and z = jySPxa. In this case, z is obtained by 

738 the concatenation rule [7— 5\(3— a]. 

739 Finally, if /3x«7 is a prefix of t, then there is a factorization y = y'y" such 

740 that t = fixa.'yy' and s = y"5. Consequently, z = y" 5j3xa^y\ showing that 

741 z is obtained from y"5^fy' and (3xa by the rule (5\f3— 0(7) . Since y"5^y' ~ u, 

742 it follows again that z £ J-"(<S'). 

743 The converse inclusion is shown very similarly. □ 

744 The proof of the proposition relies heavily on the fact that the system is 

745 alphabetic. 

746 As a consequence of the proposition, we obtain the following theorem, 

747 which is our main theorem in the circular case. 



748 Theorem 8.4 Let S = (A, I, R) be a circular alphabetic context-free splicing 

749 system. Then Lin(C(S)) is a context-free language. 



750 Proof By Proposition |J, Lin(C{S)) = where S' = (A, Lin(I), R') 

751 is the flat heterogeneous splicing system defined by R' = Ure-R ~ r ' Since / 



is context-free, the language Lin(I) is context-free. By Theorem [7.1l| , the 



language generated by S' is context-free. □ 
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808 9. Appendix: Substitution theorems for context free languages 



809 For sake of completeness, we give here a sketch of the proof Theorem 

810 5^2, together with an example. The proof is based on two lemmas. The 

811 first deals with the case of generalized context-free grammar with a single 

812 non-terminal symbol, and the second shows how to reduce the number of 

813 non-terminal symbols in the general case. 

814 Lemma 9.1 Let G = {A, {S}, S, R) be a generalized context-free grammar 
sis with a single non-terminal symbol S. The language generated by G is context- 

816 free. 

817 Sketch of proof Let L be the language generated by G and set Ms = 
sis {m | S -)■ m € R}. Let H = (A U {S}, V, X, P), S £ V, be a usual context- 

819 free grammar that generates Ms- The language L is generated by the usual 

820 context-free grammar G' = (A, V U {S}, S,PU{S ->• X}). □ 



Example 9.2 Let the grammar G with a single non-terminal symbol G = 
(A,{S},S,R) with 

R={S -> a | Sb n (c k d) n ,n >l,k>0 } 

According to the sketch of the proof given above, we define a grammar 
H = {AU{S},{X,Y,Z},X,P), with 

( X ->■ a | SY 
P= \ Y -> bYZ | bZ 
[ Z -> cZ\d 

we can check that H generate the language Ms = {a}L){Sb n (c*d) n \ n > 1}. 
we define now G' = (A, V U {S}, S, P') with 

' S -> X 
, _ I X -> a | ST 

~ ) y ->• &yz 1 6z 

Z cZ | d 

The grammar G' generates the same language as G, that is the language 
{ab ni (c*d) ni b n2 (c*d) n ' 2 ■■■b nk q(c*d) nk | k > l,m > 1,1 < » < k} 

821 The second lemma below tells us that we can reduce the problem to gram- 

822 mars with a single non-terminal symbol. 
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Lemma 9.3 Let G be a generalized context-free grammar with at least two 
non-terminal symbols. There is a generalized context-free grammar with a 
single non-terminal symbol which generates the same language as G does. 

Sketch of proof Let G = (A, V, S, R) with V of cardinal at least 2. Let 
X £ V, with 1/5. Define a grammar Gx with one non-terminal symbol 
by G x = {AUV\ {X}, {X}, X, R x ) with R x = {X -> m \ X -> m £ R}. 
Let Mx the language generated by Gx- The language Mx is context-free 



by Lemma 3.1 



Define the substitution ax on A U V by 

, s = f M x , if a = X 
1 { a }> otherwise. 

831 Define now the grammar H = (A, V \ {X}, S, P) with P = {v — > axijn) \ 

832 v -> m £ R, v G V \ {X}}. 

833 The grammar H generates the same language as G does, and it has a 

834 variable less than G. 

835 Now it suffices to iterate the process in order to obtain a grammar with 

836 one non-terminal symbol. □ 

Example 9.4 Let G = (A, V, S, R) be the generalized grammar defined by 
A = {a, b, c, d}, V = {S, T, U}, and 

( S ->■ ST | a 
R = ! T -t b n U n ,n> 1 
{ U -> cU | d 

837 In the first step, we choose to remove the non-terminal T. Following the 

838 sketch of the proof given above, we define a grammar Gt with a single non- 
839 terminal symbol T by G T = {A U {S, U}, {T}, T, {T -> b n U n , n > 1}). Let 

840 Mt be the language generated by Gt- Clearly, My = {b n U n \ n > 1} and 

841 Mt is context-free. 

Next, we define a substitution cry on ^4 U V by 



<tt(q:) 



M T if a = T , 
{a} otherwise. 



and the grammar H = (A, {S, U}, S, P) with two non-terminal symbols S, U 
by P = {v — > aT(m) \ v^m£R,v£V \ {T}}, i.e. 



P 



S -r a\Sb n U n ,n>l 
U -> cU \d 
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The grammars H and G generate the same language. 

To obtain a grammar with only one variable, we iterate the process 
by eliminating the variable U from H, and we obtain the grammar H' = 
(A, {S}, S, P') with the unique variable S and with 

P' = {S -> a | Sb n (c*d) n ,n> 1 } 

This is the grammar G of Example |9.2| above. 
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